Exploratory Data Analysis

Authors

Josh Sweren

Tori Edmunds

Ahmad Cheema

Idriss Moluh

1 Summary

The Exploratory Data Analysis (EDA) phase of this project focuses on identifying patterns and relationships within soccer match data and fan discourse from Reddit. By combining structured match metrics with unstructured online sentiment, this analysis bridges the gap between statistical performance indicators and public opinion. Key findings reveal:

  • Clear differences in fan sentiment during wins, losses, and draws.
  • Significant correlations between match and player metrics (e.g., goals) and fan engagement levels on Reddit.
  • Temporal patterns in fan discussions, with peaks occurring during key tournament phases.

2 Data Overview

The dataset integrates two primary sources:

Match Metrics: Comprehensive data from UEFA Euro 2024 and Copa América 2024, covering:

  • Match outcomes (win, loss, draw).
  • Player-specific metrics, such as goals scored, assists, and disciplinary records.
  • Match-level metadata, including attendance and venue.

Reddit Data:

  • General football-focused communities: r/soccer, r/euro2024, r/CopaAmerica
  • National team subreddits: r/spain, r/france, r/netherlands, r/england, r/argentina, r/colombia, r/uruguay, r/brazil, r/turkey, r/switzerland, r/germany, r/portugal, r/italy, r/austria, r/romania, r/slovenia, r/sakartvelo, r/belgium, r/slovakia, r/denmark, r/canada, r/venezuela, r/panama, r/ecuador
  • Includes thousands of posts and comments analyzed for sentiment polarity and emotional tone using Spark NLP.

2.1 Data Cleaning & Preparation

Reddit Data:

  • Filtering Noise: Removed non-soccer-related content, off-topic discussions, and spam.
  • Sentiment Normalization: Standardized sentiment scores across different subreddit communities to ensure comparability.
  • Time Alignment: Synced Reddit posts with match timestamps to map fan sentiment to specific game moments.

Match & Player Data:

  • Handling Missing Values: Filled gaps in data, such as attendance figures, using tournament averages where appropriate.
  • Cross-Tournament Standardization: Adjusted for differences in tournament formats (e.g., group stages vs. knockout rounds).

3 Datasets

3.1 Soccer Subreddit

Firstly, in order to assess the prevalence of UEFA Euro 2024 and Copa América 2024 on Reddit, we can take a look at the total volume of comments over time on the widely-used soccer subreddit. Due to its heavy moderation and strict naming conventions for individual submissions, we concluded that submission volume would entail too much human intervention, thus making comment count a more representative metric to determine platform usage. The figure below illustrates that given our dataset that ranges from May 2023 to August 2024, the periods in which these tournaments coincided (denoted by the red dotted lines) resulted in a noticeable and sustained uptick in comment volume. The plot reflects a 7-day moving average of total daily comments on the subreddit in order to better capture traffic over the tournament period, rather than specific match days. This peak is also significant due to the fact that most major domestic leagues have their offseason during the summer, meaning that the volume increase came during a period of fewer matches across the world. Therefore, it is reasonable to deduce that the coincidence of two major international tournaments directly led to increased traffic on the soccer subreddit.

Reddit users from England have the highest participation in the soccer subreddit. These users contributed 4,422 comments with an average score of 14, indicating a generally favorable reception to their posts. They also gave a moderate number of controversial submissions (238), with an average controversiality of 0.05. In contrast, users from Netherlands submitted 872 comments with an average score of 8, reflecting a lower positive engagement. However, they exhibit a high average controversiality of 0.07, demonstrating a tendency toward more polarizing content. In Spain, users contributed 837 comments with the highest average score of 22, suggesting their posts received a more favorable reception. Furthermore, their average controversiality of 0.04 is relatively low, showing that their content is less divisive. France has the lowest engagement, with only 544 comments and an average score of 11. Their controversial count (8) and average controversiality (0.01) are the lowest across all groups. This data suggests that user origin may influence both the volume of engagement and the tone of interactions on the platform.

Author Origin Comment Count Average Score Controversial Count Average Controversiality
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

Reddit users from Argentina demonstrated the highest participation in the Copa América discussions within the soccer subreddit, contributing 1,415 comments with an average score of 15. This suggests their posts were generally well-received, although their controversial submissions (111) and average controversiality of 0.08 highlight some divisive topics. Canadian users followed closely with 1,245 comments, but their average score of 8 indicates a relatively lower level of positive engagement. With 64 controversial submissions and an average controversiality of 0.05, their discussions were moderately contentious. Colombian users contributed 773 comments, matching Argentina’s average score of 15, suggesting similar favorability in their contributions. However, their low controversial count (23) and average controversiality of 0.03 reveal less divisive content overall. Finally, Uruguayan users, despite having the fewest comments (139), achieved the highest average score of 30, signaling highly impactful posts. However, their average controversiality of 0.12 and controversial count (17) suggest a more polarizing nature in their engagement. These insights illustrate varying levels of participation, sentiment, and divisiveness among fan bases in Copa América discussions.

Author Origin Comment Count Average Score Controversial Count Average Controversiality
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

3.2 Country Subreddits

In addition to analyzing soccer-specific corners of Reddit, we also gathered data from each subreddit that corresponded to a country to make the knockout rounds of either tournament. Since many of these contained comments and submissions primarily in foreign languages, text analysis proved difficult, so one way to determine the volume of soccer discussion in these communities was to see whether r/soccer users were also active in country subreddits. By obtaining the proportions of comments in each subreddit that came from r/soccer users, we can see whether or not the time period of Euro 2024 and Copa América attracted soccer fans to discuss in country subreddits.

Our findings were that for most country subreddits, the proportion decreased during the specified time period, and some countries had no presence of r/soccer users at all. The figure below illustrates this reality, and indicates to us that NLP and ML tasks might be more effective using communities in which we know the users are discussing soccer. England appears to be an exception to the rule, as a great proportion of r/england commenters are also r/soccer commenters, while also seeing an increase during the tournament season. However, it is otherwise apparent that international competition did not entice soccer to bring their traffic to country subreddits.

3.3 UEFA Euro Match Data

The match data was derived from FBRef and includes information about the Home Score, Away Score, and Attendance.

In the group stage, top teams such as Germany, Spain, and Netherlands asserted their dominance, with Germany securing a commanding 5–1 victory over Scotland and Spain achieving a 3–0 win against Croatia. Meanwhile, Switzerland, Slovakia, and Ukraine delivered significant upsets, with Switzerland defeating Hungary 3–1 and Ukraine overcoming Poland in a 2–1 result. The group stage also witnessed a number of closely contested draws, including goalless stalemates between Denmark and Serbia, and England and Slovenia, reflecting the tournament’s overall competitiveness.

As the tournament progressed into the Round of 16, the intensity heightened, with Spain continuing their strong form with a dominant 4–1 victory over Georgia, while France narrowly defeated Belgium 1–0. Several matches, such as the 0–0 draw between Portugal and Slovenia, underscored the high stakes of the knockout rounds, with Portugal advancing after a penalty shootout. In the quarter-finals, Spain edged past Germany 2–1, and Netherlands overcame Türkiye with a similar scoreline, further solidifying their positions as strong contenders. The quarter-final clash between Portugal and France was particularly memorable, as France triumphed 3–0 in a penalty shootout following a 0–0 draw after regular and extra time.

The semi-finals saw Spain overcome France 2–1, while England triumphed 2–1 over Netherlands, setting up a highly anticipated final between Spain and England. The final, held on July 14, 2024, concluded with Spain securing a 2–1 victory, capturing their second European Championship title.

Date Round Score Home Away Attendance Home Score Away Score Winning Team Losing Team
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

3.4 Copa América Match Data

The match data was derived from FBRef and includes information about the Home Score, Away Score, and Attendance.

The group stage began with a dominant 2-0 win for Argentina over Canada, while Venezuela claimed a 2-1 victory over Ecuador. Other notable matches included a goalless draw between Peru and Chile and a 3-1 victory for Uruguay over Panama. The group stage also saw Brazil held to a draw by Costa Rica, and Mexico narrowly defeating Jamaica 1-0.

As the group stage progressed, upsets and exciting moments continued to unfold. Canada defeated Peru 1-0, while Chile edged Argentina with a 1-0 win. Venezuela continued their strong run, defeating Mexico 1-0. In another surprising result, Panama triumphed over the United States 2-1. The group stage concluded with Brazil securing a 4-1 win against Paraguay and Argentina finishing their group campaign with a 2-0 win over Peru.

Moving into the quarter-finals, the matches were equally tense and closely contested. Argentina played to a 1-1 draw against Ecuador, while Colombia overwhelmed Panama with a commanding 5-0 victory. Venezuela and Canada also played out a 1-1 draw, and Uruguay and Brazil were held to a goalless stalemate. The quarter-finals saw penalty shootouts after draws, with teams continuing to battle fiercely for a spot in the semi-finals.

In the semi-finals, Argentina overcame Canada 2-0, and Colombia triumphed 1-0 over Uruguay, setting up an exciting final. The third-place match between Canada and Uruguay ended 2-2, with Uruguay winning in penalties. The final, held on July 15, 2024, saw Argentina securing their victory with a narrow 1-0 win over Colombia, capturing their 15th Copa América title.

Date Round Score Home Away Attendance Home Score Away Score Winning Team Losing Team
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

3.5 Player Statistics

This data from FBRef highlights the top scorers in the UEFA Euro 2024, consisting of players who scored more than one goal during the tournament. Leading the charge are Cody Gakpo, Dani Olmo, and Georges Mikautadze, each with 3 goals, representing Netherlands, Spain, and Georgia, respectively. These players have demonstrated exceptional attacking prowess, contributing significantly to their teams’ offensive output. Other notable top scorers include Harry Kane from England, Ivan Schranz from Slovakia, and Jamal Musiala from Germany, all of whom netted 3 goals as well. Among these top scorers, Harry Kane, at 30 years old, stands out as one of the more experienced players, while Jamal Musiala, at just 20 years old, represents a younger generation of emerging talent. Several other players contributed significantly with 2 goals, including Breel Embolo (Switzerland), Donyell Malen (Netherlands), Fabián Ruiz Peña (Spain), and Florian Wirtz (Germany), among others. Notably, Jude Bellingham (England) and Kai Havertz (Germany), both 20 years old, continue to showcase their potential on the international stage. The top scorers in this table highlight a blend of seasoned stars and rising talents, contributing to the excitement and competitiveness of the tournament.

Player Pos Age Country Goals
Cody Gakpo FW 24 Netherlands 3
Dani Olmo MF,FW 25 Spain 3
Georges Mikautadze FW 23 Georgia 3
Harry Kane FW 30 England 3
Ivan Schranz FW 30 Slovakia 3
Jamal Musiala FW 20 Germany 3
Breel Embolo FW 26 Switzerland 2
Donyell Malen FW 25 Netherlands 2
Fabián Ruiz Peña MF 27 Spain 2
Florian Wirtz FW,MF 20 Germany 2
Jude Bellingham MF,FW 20 England 2
Kai Havertz FW,MF 24 Germany 2
Merih Demiral DF 25 Türkiye 2
Niclas Füllkrug FW 30 Germany 2
Nico Williams FW 21 Spain 2
Răzvan Marin MF 27 Romania 2

This data from FBRef provides an overview of players who scored at least one goal in the UEFA Euro 2024, offering insights into various countries’ offensive performances. Notably, Spain leads with the highest total goals scored (14) from a squad of 10 players, indicating a well-rounded offensive effort. Germany, with 6 players, follows closely behind with 11 goals, showcasing strong team contributions. England, with a smaller squad of 5 players, has 8 goals, reflecting a highly efficient attack. Other countries such as Italy, Switzerland, and Turkey also demonstrate solid goal-scoring performances, while countries like Serbia and Scotland have fewer goals, though Serbia stands out for its impressive 100% shots on target conversion. In terms of shot accuracy, Italy leads with the highest shots on target percentage (72.2%), while Denmark and Portugal struggle with lower percentages. The average age across the teams is relatively young, with England having the youngest squad at an average age of 24. The table also shows a variation in average shooting distance, with teams like Serbia opting for shorter-range shots, while others like Denmark take longer-range efforts. Overall, the data highlights diverse offensive strategies and varying levels of goal-scoring efficiency across the participating countries in the tournament.

Country Player Count Average Age Total Goals Scored Shots On Target Average Distance
Loading ITables v2.2.4 from the init_notebook_mode cell... (need help?)

4 Goals & Attendance

4.1 UEFA Euro Goals

In analyzing the performance of various national football teams based on their total goals, games played, and average expected goals (xG), several key insights emerge that highlight differences in actual performance versus expectations. Teams such as Spain and Germany stand out for their exceptional goal-scoring ability relative to their expected goals. Spain, with 15 goals over 7 games and an xG of 1.69, has significantly outperformed its xG, suggesting effective finishing and possibly a higher-than-expected conversion rate of scoring opportunities. Similarly, Germany’s performance of 10 goals in 4 games, paired with an xG of 2.07, highlights their ability to capitalize on attacking chances and exceed their expected output. These teams demonstrate efficiency in front of goal, possibly indicating superior tactical execution or individual player quality that contributes to their higher-than-expected goal tally.

Conversely, other teams in the dataset, such as Scotland and Serbia, have underperformed relative to their xG values. Scotland scored only 2 goals in 3 games despite an xG of 0.23, which suggests a significant underachievement, possibly due to poor finishing or lack of quality chances created. Serbia, with just 1 goal in 3 games and an xG of 0.58, similarly appears to have struggled to convert opportunities into actual goals. This discrepancy between actual and expected performance underscores the importance of not just creating scoring chances but also the ability to finish those chances effectively. These findings point to areas where teams may need to focus on improving their finishing, while the teams outperforming their xG might benefit from further tactical refinement to sustain their high output over a longer period.

4.2 UEFA Euro Attendance

Attendance trends during the 2024 UEFA Euro Championship indicate a clear correlation between match significance and crowd size. High-profile games, particularly semi-finals and the final, consistently attracted larger audiences, with venues such as the Allianz Arena and Olympiastadion Berlin seeing attendance figures of over 60,000 spectators. This trend aligns with the high demand for tickets to major fixtures, highlighting the direct impact of match importance on fan turnout.

Venue selection also played a critical role in attendance patterns, with certain stadiums like Signal Iduna Park and Veltins-Arena regularly hosting large crowds. Notably, prime time match scheduling, particularly in the evening, contributed to higher attendance rates, as seen in fixtures like Spain vs. Italy at Veltins-Arena. These findings underscore the interplay between venue capacity, match timing, and the overall significance of the fixture in determining fan attendance.

4.3 Copa América Goals

The data reveals notable differences in offensive performance across teams, with Argentina leading the tournament in total goals, scoring 9 goals in 6 games. Argentina’s average expected goals (xG) of 2.26 per match suggests they had strong offensive efficiency relative to their opportunities. In contrast, teams like Chile and Peru, who did not score any goals, had lower xG values, indicating struggles to convert chances into goals.

Brazil, Colombia, and Uruguay were also high scorers, with 5, 12, and 11 goals respectively, supported by solid xG values—Brazil (1.6), Colombia (1.58), and Uruguay (1.54)—indicating consistent offensive production throughout the tournament. On the other hand, countries like Costa Rica, Jamaica, and Panama had relatively lower xG values, highlighting their more defensive or less effective attacking performances. The discrepancy between total goals and xG values further illustrates how teams like Mexico, with 1 goal from 3 games and an xG of 1.9, may have underperformed in terms of finishing opportunities.

4.4 Copa América Attendance

The 2024 Copa América matches took place at a variety of prominent venues across the United States, with stadiums such as AT&T Stadium, Allegiant Stadium, and Hard Rock Stadium hosting some of the key games. These venues provided ample seating, accommodating large crowds ranging from 15,625 at Children’s Mercy Park to over 70,000 at Levi’s Stadium and Bank of America Stadium. This wide range of attendance numbers reflects the varying levels of interest in different matches, with high-stakes games, like Argentina vs. Colombia in the final, attracting larger crowds compared to group-stage encounters featuring less popular teams. The largest attendance of 81,106 was recorded at MetLife Stadium during a group-stage match between Argentina and Chile, highlighting the popularity of high-profile matchups.

The attendance figures show a clear correlation between match importance and crowd size. For example, the final between Argentina and Colombia at Hard Rock Stadium drew 65,300 spectators, while the semi-finals and quarter-finals also saw significant turnout, with venues like NRG Stadium and Allegiant Stadium hosting around 50,000 to 70,000 fans. On the other hand, less anticipated matchups, such as Costa Rica vs. Paraguay at Q2 Stadium, saw relatively smaller audiences, with only 12,765 attending. The overall venue variety and attendance patterns suggest that larger, more established stadiums drew bigger crowds for the most crucial matches, enhancing the tournament’s atmosphere and experience for both players and fans.

5 Linking Reddit Data

Using data from FBREF and the soccer subreddit, comments were filtered based on whether they occurred during or immediately after a match in which the commenter’s flair aligned with a competing country. For example, if a commenter with an “England” user flair comments within four hours after the start time of an England match, that comment will be part of our final dataset. Considering this, we can then assess whether or not a commenter’s team won the match on which they were commenting, providing a framework to conduct future NLP and ML tasks on the predictive ability of comments on real-world performance.

The figure below shows the volume of comments for each match, partitioned by whether it was made by a fan of the winning or losing team. From this, we can make a few observations. Firstly, matches involving England appear to result in the greatest volume. This is unsurprising due to the lack of a language barrier, a large population, and a deep history of soccer fandom. On the Copa América side, matches involving Argentina result in heavy traffic, likely due to their involvement in the final and the stardom of Lionel Messi. Finally, the prevalence of Canada fans as the winner against Venezuela and the loser against Argentina is hard to miss, and is likely illustrative of Reddit being a site predominantly used by North Americans.

In addition to lining up comments with match times, we used the timing of submissions on the soccer subreddit to determine the time of events within a match, including goals and the end of a match. This allowed us to append to the comment dataset information on whether or not the commenter’s team was winning at the time of their comment, as well as the most recent event at the time of the comment. These are important because they provide context to what was happening at the time, and will allow for more granular analysis down the road.

The following three visuals show the volume of comments for whether or not the commenter’s team won, the status of the match at the time of commenting, and the most recent event at the time of commenting. The main takeaway from this is that we do see significantly more comments when the user’s team is experiencing positive results than negative.

6 EDA Code Repository

Explore the code for our Exploratory Data Analysis here.